feat(x/evm): migrate x/evm metrics to OpenTelemetry Meter API#3423
Conversation
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3423 +/- ##
=======================================
Coverage 59.23% 59.23%
=======================================
Files 2116 2118 +2
Lines 175440 175463 +23
=======================================
+ Hits 103916 103932 +16
- Misses 62460 62463 +3
- Partials 9064 9068 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| else \ | ||
| DETACH_FLAG=""; \ | ||
| fi; \ | ||
| DOCKER_PLATFORM=$(DOCKER_PLATFORM) USERID=$(shell id -u) GROUPID=$(shell id -g) GOCACHE=$(shell go env GOCACHE) NUM_ACCOUNTS=10 INVARIANT_CHECK_INTERVAL=${INVARIANT_CHECK_INTERVAL} UPGRADE_VERSION_LIST=${UPGRADE_VERSION_LIST} MOCK_BALANCES=${MOCK_BALANCES} GIGA_EXECUTOR=${GIGA_EXECUTOR} GIGA_OCC=${GIGA_OCC} RECEIPT_BACKEND=${RECEIPT_BACKEND} AUTOBAHN=${AUTOBAHN} GIGA_STORAGE=${GIGA_STORAGE} docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up --no-attach grafana --no-attach prometheus $$DETACH_FLAG |
There was a problem hiding this comment.
Grafana and Prometheus were adding a lot of noise to the logs so added --no-attach flag. Their logs can still be viewed with docker logs sei-grafana and docker logs sei-prometheus
| evaluation_interval: 15s | ||
|
|
||
| scrape_configs: | ||
| - job_name: 'cryptosim' |
There was a problem hiding this comment.
I will rollback these changes to existing prometheus and grafana yaml files and create new configs.
| apiVersion: 1 | ||
| providers: | ||
| - name: default | ||
| orgId: 1 | ||
| folder: "" | ||
| type: file | ||
| disableDeletion: false | ||
| updateIntervalSeconds: 30 | ||
| options: | ||
| path: /var/lib/grafana/dashboards | ||
| foldersFromFilesStructure: false |
There was a problem hiding this comment.
I don't have context on these dashboards here. Are these to spin up dashboards for the docker compose personal stack? Seems like yes but want to confirm.
We're creating a mirror in the platform deployment, is that right?
There was a problem hiding this comment.
Actually these are for local (docker) environment. So the idea is having a new target in MakeFile so we can run make docker-cluster-start-monitoring locally with docker compose and have a prometheus and grafana local containers and Grafana user interface to investigate metrics, do performance measurement of our changes without having to release them.
We currently don't have a clean setup locally for this. We can run scripts like ./docker/monitornode/scripts/start-prometheus.sh but their ports conflict with existing setup and we need to shut them down manually.
| @@ -0,0 +1,8 @@ | |||
| apiVersion: 1 | |||
There was a problem hiding this comment.
Lots of docker compose pieces here. I'm slightly leaning towards us not creating these and just supporting them on the new platform Grafana. Seems like complexity & scope that isn't worth it's weight to me although I like that we are thinking about the tooling end of this.
There was a problem hiding this comment.
Thanks for feedback. I think these help with testing locally and iterating quickly for measuring performance changes against a metric. If that is ok, I would rather keep it 😊
| // this nonce has already been mined, we cannot accept it again | ||
| metrics.IncrementPendingNonce("rejected") | ||
| utilmetrics.IncrementPendingNonce("rejected") // TODO(PLT-330): remove once evm_pending_nonce_total verified | ||
| evmAnteMetrics.pendingNonce.Add(ctx.Context(), 1, otelmetric.WithAttributes(attribute.String("event", "rejected"))) |
There was a problem hiding this comment.
nit: you could define the attributes once and reuse them since they are the same each time. Prevents creating a now struct each time to represent this and just reuses an immutable instance.
Should apply to all of these that have a known set of possible values ahead of time.
| cause := "too_low" | ||
| if tooHigh { | ||
| cause = "too_high" |
There was a problem hiding this comment.
nit: could simplify to "lower" and "higher" just too simplify use for dashboards or tools.
| func mustAnteMetric[V any](v V, err error) V { | ||
| if err != nil { | ||
| panic(err) | ||
| } | ||
| return v |
There was a problem hiding this comment.
nit: you're doing this but in different ways across packages. Consider consolidating into a single place for all your metrics structs to reuse.
There was a problem hiding this comment.
Thanks for feedback. Per discussion with @masih , it seems in GO the emphasis is on packages being independent and not have coupling for code reuse.
| # Start 4-node cluster with Prometheus and Grafana monitoring | ||
| docker-cluster-start-monitoring: docker-cluster-stop build-docker-node | ||
| @rm -rf $(PROJECT_HOME)/build/generated | ||
| @mkdir -p $(shell go env GOPATH)/pkg/mod | ||
| @mkdir -p $(shell go env GOCACHE) | ||
| @cd docker && \ | ||
| if [ "$${DOCKER_DETACH:-}" = "true" ]; then \ | ||
| DETACH_FLAG="-d"; \ | ||
| else \ | ||
| DETACH_FLAG=""; \ | ||
| fi; \ | ||
| DOCKER_PLATFORM=$(DOCKER_PLATFORM) USERID=$(shell id -u) GROUPID=$(shell id -g) GOCACHE=$(shell go env GOCACHE) NUM_ACCOUNTS=10 INVARIANT_CHECK_INTERVAL=${INVARIANT_CHECK_INTERVAL} UPGRADE_VERSION_LIST=${UPGRADE_VERSION_LIST} MOCK_BALANCES=${MOCK_BALANCES} GIGA_EXECUTOR=${GIGA_EXECUTOR} GIGA_OCC=${GIGA_OCC} RECEIPT_BACKEND=${RECEIPT_BACKEND} AUTOBAHN=${AUTOBAHN} GIGA_STORAGE=${GIGA_STORAGE} docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up --no-attach grafana --no-attach prometheus $$DETACH_FLAG | ||
| .PHONY: docker-cluster-start-monitoring | ||
|
|
||
| # Stop monitoring containers (Prometheus and Grafana) and cluster | ||
| docker-cluster-stop-monitoring: | ||
| @cd docker && DOCKER_PLATFORM=$(DOCKER_PLATFORM) USERID=$(shell id -u) GROUPID=$(shell id -g) GOCACHE=$(shell go env GOCACHE) docker compose -f docker-compose.yml -f docker-compose.monitoring.yml down | ||
| .PHONY: docker-cluster-stop-monitoring |
There was a problem hiding this comment.
Any issues with testing this in your Harbour personal stack? Let me know if I can help. Ideally it's easy enough that you can use that instead of rolling net-new infra like this to test your changes
There was a problem hiding this comment.
I believe this local version could allow faster iterations and experiments, to make sure there is value in the optimization example being made. Once we confirm, we can move to Harbour and test that modification in a more prod like environment. It also can lower our AWS cost to not take it to Harbour when there is no observed improvement.
96669ca to
6eb1415
Compare
6eb1415 to
f9e0114
Compare
2d4f018 to
d2c9c1b
Compare
masih
left a comment
There was a problem hiding this comment.
Left a couple of recommendations. This PR is still a PR to a non-main branch. Could we have this based against main please for the final review?
Thanks Amir!
| @@ -0,0 +1,8 @@ | |||
| apiVersion: 1 | |||
There was a problem hiding this comment.
@amir-deris could you move the docker grafana stuff to a separate PR please?
There was a problem hiding this comment.
Yes sir! will do that.
| utilmetrics.IncrementNonceMismatch(tooHigh) // TODO(PLT-330): remove once evm_nonce_mismatch_total verified | ||
| cause := "lower" | ||
| if tooHigh { | ||
| cause = "higher" |
There was a problem hiding this comment.
Turn tooHigh into a bool attribute instead if stringify-ing it?
…eus containers with docker compose
…ing existing ones
…d fixing gas big int overflow issue
4b6883f to
a5c2a10
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ba7a653. Configure here.
| "evm_pending_nonce", | ||
| metric.WithDescription("EVM pending nonce events by type (added, expired, rejected, accepted)"), | ||
| metric.WithUnit("{count}"), | ||
| )), |
There was a problem hiding this comment.
Unused pendingNonce metric registered but never emitted
Low Severity
The pendingNonce counter is registered in evmAnteMetrics (allocating OTel resources at init) but evmAnteMetrics.pendingNonce is never referenced anywhere in the codebase. The corresponding legacy function IncrementPendingNonce in utils/metrics is also never called. This is dead code that registers a metric instrument at startup for no purpose. The same pattern exists in app/ante/metrics.go with app_pending_nonce, which is also unused.
Reviewed by Cursor Bugbot for commit ba7a653. Configure here.


Summary
Migrates telemetry in the
x/evm/keeperandx/evm/antepackages from the legacytelemetry/utilmetricshelpers to the standardized OpenTelemetry Meter API, following the same pattern established in #3396 forappandapp/ante.x/evm/keeper/metrics.gowith a struct-based OTel instrument set registered viaotel.Meter("evm_keeper"), initialized once viaInitEvmKeeperMetrics()called fromapp.goafterSetupOtelMetricsProvider()x/evm/ante/metrics.gowith anevmAnteMetricsstruct viaotel.Meter("evm_ante"), initialized once viaInitEvmAnteMetrics()telemetry.ModuleMeasureSince/utilmetrics/seimetrics.SafeTelemetry*calls in all target files with dual-emitted OTel instruments; legacy calls remain withTODO(PLT-330)markers until dashboards are migratedctx.Context()/goCtxthrough all.Record()and.Add()call sitesevm_association_error_totalcounter that was registered by both keeper and ante intoevm_keeper_association_error_totalandevm_ante_association_error_totalNew metrics (OTel naming convention, exported via the process-wide MeterProvider)
ABCI phase durations — histograms bucketed at fine-grained ms thresholds:
evm_abci_begin_blocker_duration_secondsevm_abci_end_blocker_duration_secondsBlock fee:
evm_block_base_fee— synchronous gauge set each EndBlockEVM transaction counters:
evm_panics_totalevm_errors_total— bytypelabel (state_transition,stateDB_finalize,write_receipt,apply_message,vm_execution)evm_receipt_status_total— bystatuslabel (success,failed)Association errors:
evm_keeper_association_error_total— byscenarioandtypelabels (emitted from keeper)evm_ante_association_error_total— byscenarioandtypelabels (emitted from ante)Zero-storage cleanup:
evm_zero_storage_processed_keys_totalevm_zero_storage_pruned_keys_totalevm_zero_storage_pruned_bytes_totalAnte nonce/fee tracking:
evm_pending_nonce_total— byeventlabel (added,expired,rejected,accepted)evm_nonce_mismatch_total— bytoo_highbool attribute (true= nonce above expected,false= nonce below expected)evm_effective_gas_price— histogram of effective gas price per EVM transactionNote
Medium Risk
Touches EVM ante/keeper hot paths (BeginBlock/EndBlock, tx processing, nonce/fee checks) to add new OTel metric emissions; risk is mainly around performance/label correctness and potential numeric conversions for gauges/histograms.
Note
Medium Risk
Touches EVM ante/keeper hot paths (tx processing and BeginBlock/EndBlock) to add new OpenTelemetry metric emissions alongside legacy metrics; risk is mainly performance overhead and correctness of labels/value conversions (big.Int→float64).
Overview
Adds new OpenTelemetry Meter instrument sets for
x/evm/anteandx/evm/keeper(newmetrics.gofiles) and wires them into EVM hot paths.Keeper now records BeginBlock/EndBlock duration histograms, base-fee gauge values (with safe
big.Int→float64conversion), tx panic/error/receipt-status counters, association-error counters, and zero-storage cleanup counters; ante now records effective gas price histograms plus nonce-mismatch and association-error counters.Legacy
utils/metrics/SafeTelemetry*calls remain in place but are renamed/aliased and annotated withTODO(PLT-330)while dashboards migrate.Reviewed by Cursor Bugbot for commit ba7a653. Bugbot is set up for automated code reviews on this repo. Configure here.